In our previous work\(^1\), we illustrated the strengths of UMAP in dimension reduction. Though it generated visual clusters, extracting cluster assignments using an automated method was challenging. We show that HDBSCAN(\(\hat{\epsilon}\)), it is possible to extract very accurate clusters in a meaningful way.
The visualizations provided below were generated by running UMAP on Thousand Genomes Project (1000GP) genotype data (for the demonstration and manuscript, see: https://github.com/diazale/1KGP_dimred/). Specifically, we used UMAP on the top 100 principal components, reduced to two dimensions, parameterized with nearest_neighbours=15 and minimum_distance=0.5. There are two colouring schemes used:
The clusters were generated by running HDBSCAN(\(\hat{\epsilon}\)) with minimum_points=25, \(\hat{\epsilon}=0.5\) on a UMAP projection of the top 16 principal components, reduced to five dimensions, parameterized with nearest_neighbours=50 and minimum_distance=0.01. They were generated using the Python packages for umap\(^2\) and hdbscan\(^3\).
All projections used in this project are provided in the github repo (https://github.com/diazale/popgenclust)
s <- 0.1
a <- 0.4
f <- "hdbscan_labels_min25_EPS0.5_1000G_UMAP_PC16_NC5_NN50_MD0.01_euclidean_2019814225811"
clusters <- import_clusters(cluster_dir, f)
main_data <- data.frame(cbind(ids, coords))
colnames(main_data)[1] <- "ID"
main_data$ID <- as.character(main_data$ID)
main_data <- main_data[,-c(2)]
# add population labels
main_data <- merge(main_data, samples, by.x = "ID", by.y = "sample")
# add cluster labels
main_data$cluster <- clusters$cluster
pop_means <- get_cluster_means(main_data, "x1", "x2", "pop")
# Note that in this 2D visualization, the ITU and GIH populations are split into multiple clusters
pop_means <- subset(pop_means, !(pop %in% c("ITU","GIH")))
pop_means <- rbind.data.frame(pop_means, list("ITU",-7,5))
pop_means <- rbind.data.frame(pop_means, list("ITU",15,16))
pop_means <- rbind.data.frame(pop_means, list("GIH",-7,5))
pop_means <- rbind.data.frame(pop_means, list("GIH",2,15))
ggplot(main_data, aes(x=x1, y=x2, colour=pop)) +
geom_point(size=s, alpha=a) +
scale_color_manual(name = "pop", values=pal) +
ggtitle("Thousand Genomes, coloured by population") +
xlab("UMAP1") + ylab("UMAP2") + theme_bw() +
theme(legend.position = "none") +
geom_label_repel(data = pop_means,
aes(x=x1, y=x2, label=pop, colour = as.factor(pop)), size=5, alpha=0.6, label.size=0.05)
# In this visualization, cluster 16 (mostly ITU individuals) is split into multiple clusters
cluster_means <- get_cluster_means(main_data, "x1", "x2", "cluster")
cluster_means <- subset(cluster_means, cluster!=16)
cluster_means_extra <- cluster_means[FALSE,]
cluster_means_extra <- rbind.data.frame(cluster_means_extra, data.frame(cluster="16",x1=-4.3,x2=5))
cluster_means_extra <- rbind.data.frame(cluster_means_extra, data.frame(cluster="16",x1=15,x2=14))
# Colour palette
colourCount <- length(unique(main_data$cluster))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
colpal <- colorRampPalette(brewer.pal(8, "Set1"))(colourCount)
# manually change some colours:
# 14 is too yellow
colpal[15] <- "#FFBC2D"
# randomly permute the colours for clarity.
# sequential clusters are often (1) next to each other and (2) similarly coloured.
colpal <- sample(colpal)
# Some colours too similar:
# 11 and 12, 9 and 10,
ggplot(main_data, aes(x=x1, y=x2, colour=cluster)) +#, shape=cluster)) +
geom_point(size=s, alpha=a) +
scale_color_manual(values = colpal) +
geom_label_repel(data = cluster_means,
aes(x=x1, y=x2, label=cluster, colour = as.factor(cluster)), size=5, alpha=0.6) +
geom_label(data = cluster_means_extra,
aes(x=x1, y=x2, label=cluster, colour = as.factor(cluster)), size=5, alpha=0.6) +
ggtitle("Thousand Genomes, coloured by clustering algorithm") +
xlab("UMAP1") + ylab("UMAP2") + theme_bw() +
theme(legend.position = "none")
Clearly, there is a very strong correlation between the provided population labels and the extracted clusters.
Though useful, the static plots can obfuscate some details. Here we provide interactive plots, coloured by population and by clusters extracted. Hover over the points to get details about populations and clusters!
s <- 2
main_data <- data.frame(cbind(ids, coords))
colnames(main_data)[1] <- "ID"
main_data$ID <- as.character(main_data$ID)
main_data <- main_data[,-c(2)]
# add population labels
main_data <- merge(main_data, samples, by.x = "ID", by.y = "sample")
# add cluster labels
main_data$cluster <- clusters$cluster
# unassigned boolean
# The clustering for this demo has been selected to not have any unclustered individuals
main_data$unassigned <- ifelse(main_data$cluster==-1,"unassigned","assigned")
colourCount <- length(unique(main_data$cluster))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
fig_main <- plot_ly(data = main_data, x = ~x1, y = ~x2,
marker = list(size=s)) %>%
add_markers(color = ~pop, colors = pal,
text = ~paste( "ID", ID, "\nCluster", cluster),
type = "scattergl", mode = "markers")
fig_main
fig_clusters <- plot_ly(data = main_data, x = ~x1, y = ~x2,
marker = list(size=s)) %>%
add_markers(color = ~cluster, colors = getPalette(colourCount),
text = ~paste( "ID", ID, "\nPopulation", pop),
type = "scattergl", mode = "markers")
fig_clusters
This chunk of code looks at specific populations to see how many of their individuals wind up in different clusters. Most populations have \(>95\%\) of individuals being assigned to a single cluster. Notable exceptions, in this parameterization, include the Indian Telugu in the UK (ITU) and Gujarati Indians of Houston (GIH). Different parameterizations in the clustering algorithm as well as the underlying UMAP projection will result in different cluster extractions.
print_pop_table <- function(p, in_data){
# Print the cluster membership of a specified population
cat(paste(p, "below"))
temp_data <- in_data %>% filter(pop==p)
print(table(temp_data$cluster))
cat("Proportion of population in largest cluster:")
print(max(table(temp_data$cluster))/sum(table(temp_data$cluster)))
cat("\n")
}
for (p in unique(main_data$pop)){
print_pop_table(p, main_data)
}
## GBR below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 105 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## FIN below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 104 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## CHS below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 171
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## PUR below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 145 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9731544
##
## CDX below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 18 19 20
## 104 0 0
## Proportion of population in largest cluster:[1] 0.9904762
##
## CLM below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 4 142 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9726027
##
## IBS below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 162 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## PEL below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 128 0 0 0 0 1 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9922481
##
## PJL below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 95 12 0 0 0 0
## Proportion of population in largest cluster:[1] 0.6129032
##
## KHV below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3
## 18 19 20
## 118 0 0
## Proportion of population in largest cluster:[1] 0.9752066
##
## ACB below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 122 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## GWD below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 179 1
## Proportion of population in largest cluster:[1] 0.9944444
##
## ESN below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 172 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## BEB below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 133 0 0 0 0 0 0 0 10 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9300699
##
## MSL below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 122
## Proportion of population in largest cluster:[1] 1
##
## STU below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 124 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.96875
##
## ITU below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 109 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9237288
##
## CEU below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 183 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## YRI below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 1 181 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9945055
##
## CHB below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 105
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## JPT below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 104 0 0 0 1
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9904762
##
## LWK below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 110 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## ASW below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 103 0 0 0 0 1 3 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9626168
##
## MXL below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 0 0 0 7 0 0 0 0 0 0 0 97 0 0 0 0 0 0 0 0 0
## Proportion of population in largest cluster:[1] 0.9326923
##
## TSI below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 111 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## GIH below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 0 0 0 0 0 69 0 0 0 0 0 0 0 0 0 41 1 0 0 0 0
## Proportion of population in largest cluster:[1] 0.6216216